---
id: evaluation-end-to-end-llm-evals
title: End-to-End LLM Evaluation
sidebar_label: End-to-End Evals
---

import Tabs from "@theme/Tabs";
import TabItem from "@theme/TabItem";
import { Timeline, TimelineItem } from "@site/src/components/Timeline";
import VideoDisplayer from "@site/src/components/VideoDisplayer";

End-to-end evaluation assesses the "observable" inputs and outputs of your LLM application - it is what users see, and treats your LLM application as a black-box.

![end-to-end evals](https://deepeval-docs.s3.us-east-1.amazonaws.com/docs:end-to-end-llm-evals.png)

<details>
<summary><strong>When should you run End-to-End evaluations?</strong></summary>

For simple LLM applications like basic RAG pipelines with "flat" architectures
that can be represented by a single <code>LLMTestCase</code>, end-to-end
evaluation is ideal. Common use cases that are suitable for end-to-end
evaluation include (not inclusive):

- RAG QA
- PDF extraction
- Writing assitants
- Summarization
- etc.

You'll notice that use cases with simplier architectures are more suited for end-to-end evaluation. However, if your system is an extremely complex agentic workflow, you might also find end-to-end evaluation more suitable as you'll might conclude that that component-level evaluation gives you too much noise in its evaluation results.

Most of what you saw in DeepEval's <a href="/docs/getting-started">quickstart</a> is end-to-end evaluation!

</details>

## What Are E2E Evals

Running an end-to-end LLM evaluation creates a **test run** — a collection of test cases that benchmarks your LLM application at a specific point in time. You would typically:

- Loop through a list of `Golden`s
- Invoke your LLM app with each golden's `input`
- Generate a set of test cases ready for evaluation
- Apply metrics to your test cases and run evaluations

:::info
To get a more fully sharable [LLM test report](https://www.confident-ai.com/docs/llm-evaluation/dashboards/testing-reports) login to Confident AI [here](https://app.confident-ai.com) or run the following in your terminal:

```bash
deepeval login
```

:::

## Setup Your Test Environment

<Timeline>
<TimelineItem title="Create a dataset">

[Datasets](/docs/evaluation-datasets) in `deepeval` allow you to store [`Golden`](/docs/evaluation-datasets#what-are-goldens)s, which are like a precursors to test cases. They allow you to create test case dynamically during evaluation time by calling your LLM application. Here's how you can create goldens:

<Tabs groupId="single-multi-turns">
<TabItem label="Single-Turn" value="single-turn">

```python
from deepeval.dataset import Golden

goldens=[
    Golden(input="What is your name?"),
    Golden(input="Choose a number between 1 to 100"),
    ...
]
```

</TabItem>
<TabItem label="Multi-Turn" value="multi-turn">

```python
from deepeval.dataset import ConversationalGolden

goldens = [
    ConversationalGolden(
        scenario="Andy Byron wants to purchase a VIP ticket to a Coldplay concert.",
        expected_outcome="Successful purchase of a ticket.",
        user_description="Andy Byron is the CEO of Astronomer.",
    ),
    ...
]
```

</TabItem>
</Tabs>

You can also generate synthetic goldens automatically using the `Synthesizer`. Learn more [here](/docs/synthesizer-introduction). You can now use these goldens to create an evaluation dataset that can be stored and loaded them anytime.

Here's an example showing how you can create and store datasets in `deepeval`:

<Tabs>
<TabItem value="confident-ai" label="Confident AI">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(goldens)
dataset.push(alias="My dataset")
```

</TabItem>
<TabItem value="csv" label="Locally as CSV">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(goldens)
dataset.save_as(
    file_type="csv",
    directory="./example"
)
```

</TabItem>
<TabItem value="json" label="Locally as JSON">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(goldens)
dataset.save_as(
    file_type="json",
    directory="./example"
)
```

</TabItem>
</Tabs>

✅ Done. You can now use this dataset anywhere to run your evaluations automatically by looping over them and generating test cases.

</TimelineItem>
<TimelineItem title="Select metrics">

When it comes to selecting metrics for your application, we recommend choosing no more than 5 metrics, comprising of:

- (2 - 3) **Generic metrics** for your application type. (_e.g. Agents, RAG, Chabot_)
- (1 - 2) **Custom metrics** for your specific use case.

You can read our [metrics section](/docs/metrics-introduction) to learn about the 50+ metrics we offer. Or come to [our discord](https://discord.com/invite/a3K9c8GRGt) and get some tailored recommendations from our team.

</TimelineItem>
</Timeline>

You can now use these test cases and metrics to run [single-turn](#single-turn-end-to-end-evals) and [multi-turn](#multi-turn-end-to-end-evals) end-to-end evals. If you've setup [tracing](/docs/evaluation-llm-tracing) for your LLM application, you can automatically [run end-to-end evals for traces](#end-to-end-evals-for-tracing) using a single line of code.

## Single-Turn E2E Evals

<Timeline>
<TimelineItem title="Load your dataset">

`deepeval` offers support for loading datasets stored in JSON files, CSV files, and hugging face datasets into an `EvaluationDataset` as either test cases or goldens.

<Tabs>
<TabItem value="confident-ai" label="Confident AI">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")
```

</TabItem>
<TabItem value="csv" label="From CSV">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()

dataset.add_goldens_from_csv_file(
    # file_path is the absolute path to your .csv file
    file_path="example.csv",
    input_col_name="query"
)
```

</TabItem>
<TabItem value="json" label="From JSON">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()

dataset.add_goldens_from_json_file(
    # file_path is the absolute path to your .json file
    file_path="example.json",
    input_key_name="query"
)
```

</TabItem>
</Tabs>

You can [learn more about loading datasets here](/docs/evaluation-datasets#load-dataset).

</TimelineItem>
<TimelineItem title="Create test cases using dataset">

You can now create `LLMTestCase`s using the goldens by calling your LLM application.

```python title="main.py"
from your_agent import your_llm_app # Replace with your LLM app
from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase

dataset = EvaluationDataset()

test_cases = []

# Create test cases from goldens
for golden in dataset.goldens:
    res, text_chunks = your_llm_app(golden.input)
    test_case = LLMTestCase(input=golden.input, actual_output=res, retrieval_context=text_chunks)
    test_cases.append(test_case)
```

You can also add test cases directly into your dataset by using the `add_test_case()` method.

</TimelineItem>
<TimelineItem title="Run end-to-end evals">

You should pass the `test_cases` and `metrics` you've decided in the `evaluate()` function to run end-to-end evals.

```python title="main.py" {}
from your_agent import your_llm_app # Replace with your LLM app
from deepeval.metrics import AnswerRelevancyMetric
from deepeval import evaluate
...

evaluate(
    test_cases=test_cases,
    metrics=[AnswerRelevancyMetric()],
    hyperparameters={
        model="gpt-4.1",
        system_prompt="..."
    }
)
```

There are **TWO** mandatory and **SIX** optional parameters when calling the `evaluate()` function for **END-TO-END** evaluation:

- `test_cases`: a list of `LLMTestCase`s **OR** `ConversationalTestCase`s, or an `EvaluationDataset`. You cannot evaluate `LLMTestCase`/`MLLMTestCase`s and `ConversationalTestCase`s in the same test run.
- `metrics`: a list of metrics of type `BaseMetric`.
- [Optional] `hyperparameters`: a dict of type `dict[str, Union[str, int, float]]`. You can log any arbitrary hyperparameter associated with this test run to pick the best hyperparameters for your LLM application on Confident AI.
- [Optional] `identifier`: a string that allows you to better identify your test run on Confident AI.
- [Optional] `async_config`: an instance of type `AsyncConfig` that allows you to [customize the degree of concurrency](/docs/evaluation-flags-and-configs#async-configs) during evaluation. Defaulted to the default `AsyncConfig` values.
- [Optional] `display_config`:an instance of type `DisplayConfig` that allows you to [customize what is displayed](/docs/evaluation-flags-and-configs#display-configs) to the console during evaluation. Defaulted to the default `DisplayConfig` values.
- [Optional] `error_config`: an instance of type `ErrorConfig` that allows you to [customize how to handle errors](/docs/evaluation-flags-and-configs#error-configs) during evaluation. Defaulted to the default `ErrorConfig` values.
- [Optional] `cache_config`: an instance of type `CacheConfig` that allows you to [customize the caching behavior](/docs/evaluation-flags-and-configs#cache-configs) during evaluation. Defaulted to the default `CacheConfig` values.

This is exactly the same as `assert_test()` in `deepeval test run`, but in a different interface.

</TimelineItem>
</Timeline>

:::tip
We recommend logging your `hyperparameters` during your evauations as they allow you find the best model configuration for your application.

<VideoDisplayer
  src="https://confident-docs.s3.us-east-1.amazonaws.com/evaluation:parameter-insights.mp4"
  confidentUrl="https://www.confident-ai.com/docs/llm-evaluation/dashboards/model-and-prompt-insights"
  label="Parameter Insights To Find Best Model"
/>
:::

## Multi-Turn E2E Evals

<Timeline>
<TimelineItem title="Wrap chatbot in callback">

You need to define a chatbot callback to generate synthetic test cases from goldens using the `ConversationSimulator`. So, define a callback function to generate the **next chatbot response** in a conversation, given the conversation history.

<Tabs groupId="techstack">
<TabItem value="python" label="Python">

```python title="main.py" showLineNumbers={true}  "
from deepeval.test_case import Turn

async def model_callback(input: str, turns: List[Turn], thread_id: str) -> Turn:
    # Replace with your chatbot
    response = await your_chatbot(input, turns, thread_id)
    return Turn(role="assistant", content=response)
```

</TabItem>
<TabItem value="openai" label="OpenAI">

```python title=main.py showLineNumbers={true} {6}
from deepeval.test_case import Turn
from openai import OpenAI

client = OpenAI()

async def model_callback(input: str, turns: List[Turn]) -> str:
    messages = [
        {"role": "system", "content": "You are a ticket purchasing assistant"},
        *[{"role": t.role, "content": t.content} for t in turns],
        {"role": "user", "content": input},
    ]
    response = await client.chat.completions.create(model="gpt-4.1", messages=messages)
    return Turn(role="assistant", content=response.choices[0].message.content)
```

</TabItem>
<TabItem value="langchain" label="LangChain">

```python title=main.py showLineNumbers={true} {11}
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory

store = {}
llm = ChatOpenAI(model="gpt-4")
prompt = ChatPromptTemplate.from_messages([("system", "You are a ticket purchasing assistant."), MessagesPlaceholder(variable_name="history"), ("human", "{input}")])
chain_with_history = RunnableWithMessageHistory(prompt | llm, lambda session_id: store.setdefault(session_id, ChatMessageHistory()), input_messages_key="input", history_messages_key="history")

async def model_callback(input: str, thread_id: str) -> Turn:
    response = chain_with_history.invoke(
        {"input": input},
        config={"configurable": {"session_id": thread_id}}
    )
    return Turn(role="assistant", content=response.content)
```

</TabItem>
<TabItem value="llama_index" label="LlamaIndex">

```python title="main.py"  showLineNumbers={true} {9}
from llama_index.core.storage.chat_store import SimpleChatStore
from llama_index.llms.openai import OpenAI
from llama_index.core.chat_engine import SimpleChatEngine
from llama_index.core.memory import ChatMemoryBuffer

chat_store = SimpleChatStore()
llm = OpenAI(model="gpt-4")

async def model_callback(input: str, thread_id: str) -> Turn:
    memory = ChatMemoryBuffer.from_defaults(chat_store=chat_store, chat_store_key=thread_id)
    chat_engine = SimpleChatEngine.from_defaults(llm=llm, memory=memory)
    response = chat_engine.chat(input)
    return Turn(role="assistant", content=response.response)
```

</TabItem>
<TabItem value="openai-agents" label="OpenAI Agents">

```python title="main.py" showLineNumbers={true} {6}
from agents import Agent, Runner, SQLiteSession

sessions = {}
agent = Agent(name="Test Assistant", instructions="You are a helpful assistant that answers questions concisely.")

async def model_callback(input: str, thread_id: str) -> Turn:
    if thread_id not in sessions:
        sessions[thread_id] = SQLiteSession(thread_id)
    session = sessions[thread_id]
    result = await Runner.run(agent, input, session=session)
    return Turn(role="assistant", content=result.final_output)
```

</TabItem>
<TabItem value="pydantic" label="Pydantic">

```python title="main.py" showLineNumbers={true} {9}
from pydantic_ai.messages import ModelRequest, ModelResponse, UserPromptPart, TextPart
from deepeval.test_case import Turn
from datetime import datetime
from pydantic_ai import Agent
from typing import List

agent = Agent('openai:gpt-4', system_prompt="You are a helpful assistant that answers questions concisely.")

async def model_callback(input: str, turns: List[Turn]) -> Turn:
    message_history = []
    for turn in turns:
        if turn.role == "user":
            message_history.append(ModelRequest(parts=[UserPromptPart(content=turn.content, timestamp=datetime.now())], kind='request'))
        elif turn.role == "assistant":
            message_history.append(ModelResponse(parts=[TextPart(content=turn.content)], model_name='gpt-4', timestamp=datetime.now(), kind='response'))
    result = await agent.run(input, message_history=message_history)
    return Turn(role="assistant", content=result.output)
```

</TabItem>
</Tabs>

:::info
Your model callback should accept an `input`, and optionally `turns` and `thread_id`. It should return a `Turn` object.
:::

</TimelineItem>
<TimelineItem title="Load your dataset">

`deepeval` offers support for loading datasets stored in JSON files, CSV files, and hugging face datasets into an `EvaluationDataset` as either test cases or goldens.

<Tabs>
<TabItem value="confident-ai" label="Confident AI">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")
```

</TabItem>

<TabItem value="from-json" label="From JSON">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()

dataset.add_goldens_from_json_file(
    # file_path is the absolute path to your .json file
    file_path="example.json",
    input_key_name="query"
)
```

</TabItem>

<TabItem value="from-csv" label="From CSV">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()

dataset.add_goldens_from_csv_file(
    # file_path is the absolute path to your .csv file
    file_path="example.csv",
    input_col_name="query"
)
```

</TabItem>
</Tabs>

You can [learn more about loading datasets here](/docs/evaluation-datasets#load-dataset).

</TimelineItem>
<TimelineItem title="Simulate turns">

Use `deepeval`'s `ConversationSimulator` to simulate turns using goldens in your dataset:

```python title="main.py"
from deepeval.conversation_simulator import ConversationSimulator

simulator = ConversationSimulator(model_callback=chatbot_callback)
conversational_test_cases = simulator.simulate(goldens=dataset.goldens, max_turns=10)
```

Here, we only have 1 test case, but in reality you'll want to simulate from at least 20 goldens.

<details>
<summary>Click to view an example simulated test case</summary>

Your generated test cases should be populated with simulated `Turn`s, along with the `scenario`, `expected_outcome`, and `user_description` from the conversation golden.

```python
ConversationalTestCase(
    scenario="Andy Byron wants to purchase a VIP ticket to a Coldplay concert.",
    expected_outcome="Successful purchase of a ticket.",
    user_description="Andy Byron is the CEO of Astronomer.",
    turns=[
        Turn(role="user", content="Hello, how are you?"),
        Turn(role="assistant", content="I'm doing well, thank you!"),
        Turn(role="user", content="How can I help you today?"),
        Turn(role="assistant", content="I'd like to buy a ticket to a Coldplay concert."),
    ]
)
```

</details>

</TimelineItem>

<TimelineItem title="Run an evaluation">

Run an evaluation like how you learnt in the previous section:

```python title="main.py"
from deepeval.metrics import TurnRelevancyMetric
from deepeval import evaluate
...

evaluate(
  conversational_test_cases,
  metrics=[TurnRelevancyMetric()],
  hyperparameters={
      model="gpt-4.1",
      system_prompt="..."
  }
)
```

There are **TWO** mandatory and **SIX** optional parameters when calling the `evaluate()` function for **END-TO-END** evaluation:

- `test_cases`: a list of `LLMTestCase`s **OR** `ConversationalTestCase`s, or an `EvaluationDataset`. You cannot evaluate `LLMTestCase`/`MLLMTestCase`s and `ConversationalTestCase`s in the same test run.
- `metrics`: a list of metrics of type `BaseConversationalMetric`.
- [Optional] `hyperparameters`: a dict of type `dict[str, Union[str, int, float]]`. You can log any arbitrary hyperparameter associated with this test run to pick the best hyperparameters for your LLM application on Confident AI.
- [Optional] `identifier`: a string that allows you to better identify your test run on Confident AI.
- [Optional] `async_config`: an instance of type `AsyncConfig` that allows you to [customize the degree of concurrency](/docs/evaluation-flags-and-configs#async-configs) during evaluation. Defaulted to the default `AsyncConfig` values.
- [Optional] `display_config`:an instance of type `DisplayConfig` that allows you to [customize what is displayed](/docs/evaluation-flags-and-configs#display-configs) to the console during evaluation. Defaulted to the default `DisplayConfig` values.
- [Optional] `error_config`: an instance of type `ErrorConfig` that allows you to [customize how to handle errors](/docs/evaluation-flags-and-configs#error-configs) during evaluation. Defaulted to the default `ErrorConfig` values.
- [Optional] `cache_config`: an instance of type `CacheConfig` that allows you to [customize the caching behavior](/docs/evaluation-flags-and-configs#cache-configs) during evaluation. Defaulted to the default `CacheConfig` values.

This is exactly the same as `assert_test()` in `deepeval test run`, but in a difference interface.

</TimelineItem>
</Timeline>

We highly recommend setting up [Confident AI](https://app.confident-ai.com) with your `deepeval` evaluations to get professional test reports and observe trends of your LLM application's performance overtime like this:

<VideoDisplayer
  src="https://confident-docs.s3.us-east-1.amazonaws.com/evaluation:multi-turn-e2e-report.mp4"
  confidentUrl="https://www.confident-ai.com/docs/llm-evaluation/dashboards/testing-reports"
  label="Test Reports After Running Evals on Confident AI"
/>

## E2E Evals For Tracing

If you've [setup tracing](/docs/evaluation-llm-tracing) for you LLM application, you can run end-to-end evals using the `evals_iterator()` function.

<Timeline>
<TimelineItem title="Load your dataset">

`deepeval` offers support for loading datasets stored in JSON files, CSV files, and hugging face datasets into an `EvaluationDataset` as either test cases or goldens.

<Tabs>
<TabItem value="confident-ai" label="Confident AI">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="My Evals Dataset")
```

</TabItem>
<TabItem value="csv" label="From CSV">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()

dataset.add_goldens_from_csv_file(
    # file_path is the absolute path to your .csv file
    file_path="example.csv",
    input_col_name="query"
)
```

</TabItem>
<TabItem value="json" label="From JSON">

```python
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()

dataset.add_goldens_from_json_file(
    # file_path is the absolute path to your .json file
    file_path="example.json",
    input_key_name="query"
)
```

</TabItem>
</Tabs>

You can [learn more about loading datasets here](/docs/evaluation-datasets#load-dataset).

</TimelineItem>

<TimelineItem title="Update your test cases for trace">

You can update your end-to-end test cases for trace by using the `update_current_trace` function provided by `deepeval`

```python {2,10,17}
from openai import OpenAI
from deepeval.tracing import observe, update_current_trace

@observe()
def llm_app(query: str) -> str:

    @observe()
    def retriever(query: str) -> list[str]:
        chunks = ["List", "of", "text", "chunks"]
        update_current_trace(retrieval_context=chunks)
        return chunks

    @observe()
    def generator(query: str, text_chunks: list[str]) -> str:
        res = OpenAI().chat.completions.create(model="gpt-4o", messages=[{"role": "user", "content": query}]
        ).choices[0].message.content
        update_current_trace(input=query, output=res)
        return res

    return generator(query, retriever(query))
```

There are **TWO** ways to create test cases when using the `update_current_trace` function:

- [Optional] `test_case`: Takes an `LLMTestCase` to create a span level test case for that component.

- Or, You can also opt to give the values of `LLMTestCase` directly by using the following attributes:
  - [Optional] `input`
  - [Optional] `output`
  - [Optional] `retrieval_context`
  - [Optional] `context`
  - [Optional] `expected_output`
  - [Optional] `tools_called`
  - [Optional] `expected_tools`

:::note
You can use the individual `LLMTestCase` params in the `update_current_trace` function to override the values of the `test_case` you passed.
:::

</TimelineItem>
<TimelineItem title="Run end-to-end evals">

You can run end-to-end evals for your traces by supplying your `metrics` in the `evals_iterator` function.

```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.pull(alias="YOUR-DATASET-ALIAS")

for golden in dataset.evals_iterator(metrics=[AnswerRelevancyMetric()]):
    llm_app(golden.input) # Replace with your LLM app
```

There are **SIX** optional parameters when using the `evals_iterator()`:

- [Optional] `metrics`: a list of `BaseMetric` that allows you to run end-to-end evals for your traces.
- [Optional] `identifier`: a string that allows you to better identify your test run on Confident AI.
- [Optional] `async_config`: an instance of type `AsyncConfig` that allows you to [customize the degree concurrency](/docs/evaluation-flags-and-configs#async-configs) during evaluation. Defaulted to the default `AsyncConfig` values.
- [Optional] `display_config`:an instance of type `DisplayConfig` that allows you to [customize what is displayed](/docs/evaluation-flags-and-configs#display-configs) to the console during evaluation. Defaulted to the default `DisplayConfig` values.
- [Optional] `error_config`: an instance of type `ErrorConfig` that allows you to [customize how to handle errors](/docs/evaluation-flags-and-configs#error-configs) during evaluation. Defaulted to the default `ErrorConfig` values.
- [Optional] `cache_config`: an instance of type `CacheConfig` that allows you to [customize the caching behavior](/docs/evaluation-flags-and-configs#cache-configs) during evaluation. Defaulted to the default `CacheConfig` values.

This is all it takes to run end-to-end evaluations, with the added benefit of a full testing report with tracing included on Confident AI.

<VideoDisplayer
  src="https://confident-docs.s3.us-east-1.amazonaws.com/evaluation:single-turn-e2e-report-tracing.mp4"
  confidentUrl="https://www.confident-ai.com/docs/llm-evaluation/dashboards/testing-reports"
  label="Test Reports For Evals and Traces on Confident AI"
/>

</TimelineItem>
</Timeline>

If you want to run end-to-end evaluations in CI/CD piplines, [click here](/docs/evaluation-unit-testing-in-ci-cd#end-to-end-evals-in-cicd).
